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1 Introduction 

The science of networks has revolutionised research into the dynamics of interact- 
ing elements. The associated techniques have had a huge impact in a range of fields, 
from computer science to neurology, from social science to statistical physics. How- 
ever, it could be argued that epidemiology has embraced the potential of network the- 
ory more than any other discipline. There is an extremely close rela tionship between 



epidemiology and networ k theory that dates back to the mid-1980s (Klovdahl, 1985 



May and Andersonl 119871) . This is because the connections between individuals (or 
groups of individuals) that allow an infectious disease to propagate naturally define a 
network, while the network that is generated provides insights into the epidemiological 
dynamics. In particular, an understanding of the structure of the transmission network 
allows us to improve predictions of the likely distribution of infection and the early 
growth of infection (following invasion), as well as allowing the simulation of the full 
dynamics. However the interplay between networks and epidemiology goes further; 
because the network defines potential transmission routes, knowledge of its structure 
can be used as part of disease control. For example, contact tracing aims to identify 
likely transmission network connections from known infected cases and hence treat or 
contain their contacts thereby reducing the spread of infection. Contact tracing is a 
highly effective public health measure as it uses the underlying transmission dynamics 
to target control efforts and does not rely on a detailed understanding of the etiology of 
the infection. It is clear therefore that the study of networks and how they relate to the 
propagation of infectious diseases is a vital tool to understanding disease spread and 
therefore informing disease control. 



Here we review the growing body of research concerning the spread of infectious 
diseases on networks, focusing on the interplay between network theory and epidemi- 
ology. The review is split into four main sections, which examine: the types of network 
relevant to epidemiology; the multitude of ways these networks can be characterised; 
the statistical methods that can be applied to either infer the likely network structure 
or the epidemiological parameters on a realised network; and finally simulation and 
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analytical methods to determine epidemic dynamics on a given network. Given the 
breadth of areas covered and the ever-expanding number of publications (over seven 
thousand papers have been published concerning infectious diseases and networks) a 
comprehensive review of all work is impossible. Instead, we provide a personalised 
overview into the areas of network epidemiology that have seen the greatest progress 
in recent years or have the greatest potential to provide novel insights. As such consid- 
erable importance is placed on analytical approaches and statistical methods which are 
both rapidly expanding fields. We note that a range of other network-based processes 
(such as the spread of ideas or panic) can be modelled in a similar manner to the spread 
of infection, however in these contexts the transmission process is far less clear; there- 
fore throughout this review we restrict our attention to epidemiological issues. 



2 Networks, Data and Simulations 

There are a wide number of network structures and types that have been utilised when 
considering the spread of infectious diseases. Here, we consider the most common 
forms and explain their uses and limitations. Later, we review the implications of these 
structures for the spread and control of infectious diseases. 

2.1 The Ideal Network 

We start our examination of network forms by considering the ideal network that would 
allow us to completely describe the spread of any infectious pathogen. Such a net- 
work would be derived from an omniscient knowledge of individual behaviour. We 
define G 1;i (f) to be a time-varying, real, high-dimensional variable that informs about 
the strength of all potential transmission routes from individual ;' to individual j at time 
t. Any particular infectious disease can then be represented as a function (/pathogen) 
translating this high-dimensional variable into an instantaneous probabilistic transmis- 
sion rate (a single real variable). In this ideal, G subsumes all possible transmission 
networks, from sexual relations to close physical contact, face-to-face conversations, 
or brief encounters, and quantifies the time-varying strength of this contact. The dis- 
ease function then picks out (and combines) those elements of G that are relevant for 
transmission of this pathogen, delivering a new (single-valued) time-varying infection- 
specific matrix (Ty(f) = / pa thogen(Gy(f)))- This infection-specific matrix then allows 
us to define the stochastic dynamics of the infection process for a given pathogen. (For 
even greater generality, we may want to let the pathogen-specific function / also de- 
pend on the time since an individual was infected, such that time-varying infectivity or 
even time-varying transmission routes can be accommodated.) 

Obviously, the reality of transmission networks is far from this ideal. Information 
on the potential transmission routes within a population tends to be limited in a num- 
ber of aspects. Firstly, it is rare to have information on the entire population; most 
networks rely on obtaining personal information on participants and therefore partic- 
ipation is often limited. Secondly, information is generally only recorded on a single 
transmission route (e.g. face-to-face conversation or sexual partnership) and often this 
is merely recorded as the presence or absence of a contact rather than attempting to 
quantify the strength or frequency of the interaction. Finally, data on contact networks 
are rarely dynamic; what is generally recorded is whether a contact was present during 
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a particular period with little consideration given to how this pattern may change over 
time. In the light of these departures from the ideal, it is important to consider the 
specifics of different networks that have been recorded or generated, and understand 
their structure, uses and limitations. 



2.2 Realised Encounter Networks 



One of the few examples of where many of the potential transmission routes within a 
population have been documented comes from the spread of sexually transmitted infec- 
tions (STIs). In contrast with airborne infections, STIs have very obvious transmission 
routes — sex acts (or sharing needles during intravenous drug use) — and as such these 
potential transmission routes should be easily remembered (Figure 1 A). Generally the 
methodology replicates that adopted during contact tracing, getting an individual to 
name all their sexual partners over a given period, these partners are then traced and 
asked for their partners , and the process is repeated — this is known as snowball sam- 
pling (IGood man. 1961) (Figure IB). A related methodology is respondent driven sam- 
pling, where individuals are paid both for their participation and the par ticipation of 
their contacts while protecting each individual's anonymity (IHeckathorni 1 1 9971) . This 
approach, while suitable for hidden and hard to reach populations, has a number of lim- 
itations, both practical and theoretical: recruiting people into the study, getting them to 
disclose such highly personal information, imperfect recall from participants, the in- 
ability to find all partners, and the clustering of contacts. In addition, there is the theo- 
retical issue, that this algorithm will only find a single connected com ponent within the 
popu lation, and it is quite likely that multiple disjoint networks exist (IJollv and Wvlie , 
12001 1) . 



Despite these problems, and motivated by the desire to better understand the spread 
of HIV and other S TIs, several pion eering studies were performed. Probably the ear- 
liest is discussed by iKlovdahll (1 1 985b and utilises data collected by the Center for Dis- 
ease Control from 19 patients in California suffering from AIDS, leading to a net- 
work of 40 individua ls. Other larger-scale studies have been performed in Winnipeg, 
Manitoba, Canada (|Wvlie and Jolly . 2001) and Colorado Springs, Colorado, U.S.A. 



( Klovdahl et al. . 19941) . In both of these studies, participants were tested for STIs, 



and the distribution of infection compared to the underlying network structure. Work 
done on both of these networks has generally focused on network properties and the 
degree to which these can explain the observed cases; no attempt was made to use 
these networks predictively in simulations. In addition, in the Colorado Springs study 
tracing was generally only performed for a single iteration, although many initial par- 
ticipants in high-risk groups were enrolled; while in the Manitoba study tracing was 
performed as part of the routine information gathered by public health nurses. There- 
fore, while both provide a vast amount of information on sexual contacts, it is not clear 
if the results are truly a compre hensive picture of t he network and sampling biases may 



corrupt the resulting network (Gha ni et al.Lll998l) . In addition, compared to the ideal 



network, these sexual contact networks lack any form of temporal information, instead 
they provide an integration of the network over a fixed time period, and generally lack 
information on the potential strength of a contact between individuals. Despite these 
difficulties, they continue to provide an invaluable source of information on human 
sexual networks and the potential transmission routes of STIs. In particular they point 
to the extreme levels of heterogeneity in the number of sexual contacts over a given pe- 
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riod — and the variance in the numb er of contacts has been sho wn to play a significant 
role in early transmission dynamics (I Anderson and Mavu l992) 



One of the few early examples of the simulation of disease transmission on an ob- 
served network come s from a study of a small network of 22 injection drug users and 
their sexual partners (Bel l et aUll999l) (Figure 1A). In this work the risk of transmis- 
sion between two individuals in the network was imputed based on the frequency and 
types of risk behaviour connecting those two individuals. HIV transmission was mod- 
elled using a monthly time-step and single index case, and simulations were run for 
varying lengths of (simulated) time. This enabled a node's position in the network (as 
characterised by a variety of measures) to be compared with how frequently it was in- 
fected during simulations, and how many other nodes it was typically responsible for 
infecting. 



A different approach to gathering social network and behavioural data was ini- 
tiated by the Human Dynamics group at MIT and illustrates how modern technol- 
ogy can assist in the process of determining transmission networks. One of the first 
approaches was to take ad vantage of the fact that most people carry mobile phones 
(Eagl e and Pent land. 120061) . ]n 2004, 100 Nokia 6600 smart-phones pre-installed with 
software were given to MIT students to use over the course of the 2004-05 academic 
year. Amongst other things, data were collected using Bluetooth to sense other mo- 
bile phones in the vicinity. These data gave a highly detailed account of individuals 
behaviour and contact patterns. However, a limitation of this work was that Bluetooth 
has a range of up to 25 meters, and as such networks inferred from these data may not 
be epidemiological meaningful. 



A more recent study into the encounters between wild Tasmanian devils in the 
Narawn tapu National Park i n northern Tasmania utilised a similar technological ap- 



proach (Hamed e et aLL 120091) . In this work 46 Tasmanian devils were fitted with prox- 



imity loggers, that could detect and record the presence of other loggers within a 30cm 
range. As such these loggers were able to provide detailed temporal information on 
the potential interaction between these 46 animals. This study was initiated to under- 
stand the spread of Tasmanian devil facial tumour disease, which causes usually-fatal 
tumours that can be transmitted between devils if they fight and bite each other. Al- 
though only 27 loggers with complete data were recovered, and although the method- 
ology only recorded interaction between the 46 Devils in the study, the results were 
highly informative (generating a network that was far from random, heterogeneous and 
of detailed temporal resolution). Analyses based on the structure of this network sug- 
gested that targeted measures, that focus on the most highly connected ages or sex, 
were unlikely to curtail the spread of this infection. Of perhaps greater relevance is the 
potential this method illustrates for determining the contact networks of other species 
(including humans) — the only limitation being the deployment of a suitable number 
of proximity loggers. 



2.3 Inferred Encounter Networks 

Given the huge logistical difficulties of capturing the full network of interactions be- 
tween individuals within a population, a variety of methods have been developed to 
generate synthetic networks from known attributes. Generally such methods fall into 
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two classes: those that utilise egocentric information, and those that attempt to simulate 
the behaviour of individuals. 



Egocentric data generally consists of information on a number of individuals (the 
egos) and their contacts (the alters). As such the information gathered is very similar to 
that collected in the sexual contact network studies in Manitoba and Colorado Springs, 
but with only the initial step of the snowball sampling was performed; the difference is 
that for the majority of egocentric data the identity of partners (alters) is unknown and 
therefore connections between egos cannot be inferred (Figure 1C). The data therefore 
exists as multiple independent 'stars' linking the egos to the alters, which in itself pro- 
vides valuable information on heterogeneities within the network. Two major studies 
have attempted to gath er such egocentric informa t ion: the NATSAL studie s of sex- 
ual contacts in the UK djohnson et all 1 19921 1 1994 1200 it ICopas et all 12002b, and the 



POLY MOD study of social interactions within 8 European countries (IMossong et al. 



2008). The key to generating a network from such data is to probabilistically assign 



each alter a set of contacts drawn from the information available from egos; in essence, 
using the ego data to perform the next step in the snowball sampling algorithm. The 
simplest way to do this is to generate multiple copies of all the egos and to consider 
the contacts from each ego to be "half-links"; the half-links withi n the network can 
then be connected at random generating a configuration network (IMollov and R eed. 
Il995[ll998tlkead et allEooih : if more information is available on the status (age, gen- 
der, etc.) of the egos and alters then this can also be included and will reduce the 
set of half-links that can be joined together. However, in the vast majority of mod- 
elling studies, the egocentric data have si mply been used to construct WAIFW (who- 



acquires-infection-fro m-whom) matrices (iJohnson et all l200lt iMossong et all 12008 



Bague lin et al. . 2010h that inform about the relative levels of transmission between dif- 



ferent groups (e.g. based on sexual activity or age) but neglect the implicit network 
properties. This matrix-based approach is often reliable: for STIs it is the extreme het- 
erogeneity in the number of contacts (which are close to being power- law or scale- free 
distributed, see section 3.2) that drives the infe ction dynamic s (Liliero s et all 1200 lb al- 
though larger-scale structure does play a role (IGhani and Garnettt 120001) : for social in- 
teractions it is the assortativity between (age-) groups that contr ols the behaviour, wit h 



the number of contacts being distributed as a negative binomial ( Mossong et all 2008) 



The POLYMOD matrices have therefore been extensively used in the study of the 
H1N1 pandemic in 2009, providi ng important information about the cost-effective va c- 
cination of different age-classes (Medl ock and Galvanill2009l : lBaguelin et al. . 2010). 



The general config uration model approach of ran domly linking together "half- 



US 1998) has been adopted and modi- 



links" from each ego ( Moll ov and Reec 

fied to consider the spread of STIs. In particular sim ulations have been used to con- 



sider the important of concurre ncy in sexual networks ( Kre tzschmar a nd Morris, 1996; 



Morris and Kretzschm ar. 1997), where concurrency is defined as being in two ac- 
tive sexual partnerships at the same time. A dynamic sexual network was simulated, 
with partnerships being broken and reformed such that the network density remained 
constant over time. The likelihood of two nodes forming a partnership depended 
on their degree, but this relationship could be tuned to make concurrency more or 
less common, and to make the mixing assortative or disassortative based on the de- 
grees of the two nodes. Transm ission of an STI (such as gonorrhoea and chlamydia 
([Kretzschmar and Morrisl 1 19961) or HIV ([Morris and Kretzschman, 1 1997b ) was then 
simulated upon this dynamic network, showing that increasing concurrency substan- 
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A) Contacts between 22 intravenous drug users, as 
recorded in Bell 1999; squares refer to primary contacts. 
Given that the identity of contacts is known they can be 
interlinked. 




B) Caricature of a snowball sampling algorithm, squares 
are primary contacts, diamonds are secondary and 
circles are tertiary contacts. Given that the identity of 
contacts is known they can be linked. 




C) Example of a configuration model , network. Each 
individual has a prescribed degree distribution, which 
gives rise to 'half-links' that are connected at random. 




D) A household configuration network, consisting of 
completely interconnected households (cliques) with 
each individual also having one random link to another 
household 




E) Example of a small-world model based on a 2-D lattice 
with nearest neighbour connections. The small-world 
property is given by the presence of rare random links 
that can connect distant parts of the network. 



F) Map showing Great Britain, together with the movements of cattle 
from six farms (each represented in a separate colour). Notice the 
heterogeneity between farms, and the generally localised nature of 
movements. 



Figure 1 : Examples of networks used in epidemiology 
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tially increased the growth rate during the early phase of an epidemic (and, therefore, 
its size after a given period of time). This greater growth rate was related to the increase 
in giant component size (see section 3.1) that was caused by increased concurrency. 

A slightly more general appr oach to the generation of model sexual networks was 



employed bv lGhani et alj dl997l) . In their network model, individuals had a preferred 
number of concurrent partners and duration of partnerships, and their level of assorta- 
tivity was tunable. A gonorrhoea-like infection was simulated on the resulting dynamic 
network. Regression models were used to consider the association between network 
structures (either snapshots of the state of the network at the end of simulation, or ac- 
cumulated over the last 90 days of simulation) and prevalence of infection. These sim- 
ulations showed that increasing levels of concurrent partnerships made invasion of the 
network more likely, and also that the mixing patterns of the most sexually active nodes 
were most important in determining the final prevalence of infection within the popula- 



tion (IGhani et all 1 19971) . The same model was later used to consider the importance of 
different structural measures and sampling strategies, showing that it was important to 
endeavour to identify infected individuals with a high num ber of sexual partners in or - 
der to correctly define the high-risk group for interventions dGhani and Garnettl l2000h . 

The alternative approach of simulating the behaviour of individuals is obviously 
highly complex and fraught with a great deal of uncertainty. Despite these prob- 
lems, three groups have attempted just such an approach : Longini's group at Emory 



lems, three groups ha ve attempted just su ch an approach : Longim s group at fcmory 
dHalloran et al.ll2002tlLongini et al.ll2005HGermann et al|l2006tlLongini Jr. et all [2007). 



Ferguson's group at Imperial (Ferguson et al., 2005, 2006) and Eubank's group at Los 
Alamos / Virginia Tech dChowell et all 120031 lEubank et all |2004 . The models of 



both Longini and Ferguson are primarily agent-based models, where individuals are 
assigned a home and work location within which they have frequent infection-relevant 
contacts together with more random transmission in their local neighbourhood. The 
Longini models separate the entire population into sub-units of 2000 individuals (for 
the USA) or 13000 individuals (for South-East Asia) who constitute the local popula- 
tion where random transmission can operate; in contrast the Ferguson models assign 
each individual a spatial location and random transmission occurs via a spatial ker- 
nel. In principle, both of these models could be used to generate an explicit network 
model of possible contacts. The Eubank model is also agent-based aiming to capture 
the movements of 1.5 million people in Portland, Oregon, USA; but these movements 
are then used to define a network based on whether two individuals occur in the same 
place (there are 180 thousand places represented in the model) at the same time. It is 
this network that is then used to simulate the spread of infection. While in principle 
this Eubank model could be used to define a temporally varying and real-valued net- 
work (where the strength of connection would be related to the type of mixing in a 
location and the num ber of people in the location), in the epidemiological publications 



dEubank et a l.. 2004) the network is considered as a static contact network in which 
extreme heterogeneity in numbers of contacts is again predicted and the network has 
'small world' like properties (see below). A similar approach of generating artificial 
networks of individuals for stochastic simulations of respiratory disease has been re- 
cently applied to infl uenza at the scal e of the United States, and the software made 



generally available dChao et all EoiO). This software took a more realistic dynamic 



network approach and incorporated flight data within the United States, but was suffi- 
ciently resource-intensive to require specialist computing facilities (a single simulation 
taking around 192 hours of CPU time). All three models have been used to consider 
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optimal control strategies, determining the best deployment of resources in terms of 
limiting transmission associated with different routes. The predicted success of vari- 
ous control strategies therefore critically depends on the strength of contacts within the 
home, at work, within social groups, and that occur at random. 

Whilst smallpox has been eradicated, concern remains about the possibility of a de- 
liberate release of the disease. The stochastic simulation models of the Longini group 



have predominantly focused o n methods of controlling this infection dHalloran et al 



120021 lLongini Jr. et all 120071) . Their early work utilised networks of two thousand 



people with realistic age, household size and school attendance distributions, with the 
likelihood of each individual becoming infec ted being deriv e d from the number and 



type of contacts with infectious individuals dHalloran et all 120021) . This paper fo 



cused on the use of vaccination to contain a small-scale outbreak of smallpox, and 
concluded that early mass-vaccination of the entire population was more effective than 
targeted vaccination if th ere was little or no immunity in the population. Later models 
(Longini Jr. et al 1 120071) combined these sub-networks of two thousand people into a 



larger network of fifty thousand people (with one hospital), and the adult population 
were able to contact each other through workplaces and high schools. Here the focus 
was on surveillance and containment which were generally concluded to be sufficient 
to control an outbreak. The epidemiological work of the Eubank group has also fo- 
cused on a release of smallpox, although these simulations showed that encouraging 
people to stay at home as soon as they began to fee l unwell was more important than 
choice of vaccination protocol (Euban k et al. , 2004); this may in part be attributed to 



the scale-free structure of the network and hence the super-spreading nature of some 
individuals. 

The Ferguson models have primarily been used to consider the spread and control 
of pandem ic influenza, examini ng its potential spread from an initial source in South- 



East A sia (Ferguson et all 120051) . and its spread in mainland USA and Great Britain 
( Ferguson et al. , 20061) . The models of South-East Asia were primarily based on Thai- 



land, and included demographic information and satellite-based spatial measures of 
population density. It focused on containment by the targeted use of antiviral drugs 
and suggested that as long as the reproductive ratio (Ro) of a novel strain was below 
1.8 it could be contained by the rapid use of targeted antivirals and social distancing. 
However, such a strategy could require a stockpile of around 3 million antiviral doses. 
The models based on the USA and Great Britain, considered a wider range of control 
measures, including school-closures, household prophylaxis using antiviral drugs, and 
vaccination, and predicted the likely impact of different policies. 



2.4 Movement Networks 



An alternative source of network information comes from the recorded movements of 
individuals. Such data frequently describe a relatively large network as information 
on movements is often collected by national or international bodies. The network of 
movements therefore has nodes representing locations (rather than individuals) and 
edges weighted to capture the number of movements from one location to another — 
as such the network is rarely symmetric. Four main forms of movement network have 
played important roles in understanding the spread of infect i ous di seases: the airline 



transportation network ( Hufnagel et all 12004 



Guimera et al., 2005), the movement of 
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individuals to and from work dHall et al. , 2007 ; Viboud et al, , 2006a|). the movement 
of do llar bills (from which the movement of people can b e inferred) ( Brockmann et all 



2006) , and the movement of livestock (especially cattle) (IGreen et aU l2006; Ro binson et alj 

2007) . While the structure of these networks has been analysed in some detail, to de- 



velop an epidemiological model requires a fundamental assumption about how the epi- 
demic progresses within each locations. All the examples considered in this section 
make the simplifying assumption that the epidemic dynamics within each location are 
defined by random (mean-field) interactions, with the network only informing about 
the flow of individuals or just simply the flow of infe ction between populat i ons — 
such a formulation is known as a metapopulation model (iHanski and Gag giottil 120041) . 

Probably the earliest work using detailed movement data to drive simulations comes 
from the spread of 1918 pandemic influenz a in the Canadian Subarctic, based on 
records kept by the Hudson's Bay Company (Satt enspiel and Herringul998l) . A con- 
ventional SIR metapopulation model was combined with a network model (the nodes 
being three fur trading posts in the region: God's Lake, Norway House, and Oxford 
House) where some individuals remained in their home locations whilst others moved 
between locations, based on records of arrivals and departures recorded in the post 
journals. Whilst this model described only a small population, it was able to be pa- 
rameterised in considerable detail due to the quality of demographic and historical data 
available, and showed that the movement patterns observed interacted with the starting 
location of a simulated epidemic to change the relative timings of the epidemics in the 
three communities, but not the overall impact of the disease. 



The movement of passenger aircraft as collated by the International Air Transport 
Association (IATA) provides very useful information about the long-distance move- 
ment of individuals and hence how rapidly infecti on is likely to travel around the globe 
(IHufnagel et all l2004t IColizza et al.1 l2006i 120071) . Unlike many other network models 
whi ch are stochastic ind ividual-level simulations, the work of iHufnagel et al.l ( 120041) 
and Colizza et al. (2006) was based on stochastic Langevin eq uations (effectiv e ly dif- 
ferential equations with noise included). The early work bv IHufnagel et alJ (2004) 
focused on the spread of SARS, and showed a remarkable degree of similarity be- 
tween predictions and the global spread of this disease. This work also showed that 
extreme sensitivity to initial conditions arises from the structure of the network, with 
outbreaks starting in different locations generating very different spatial distributions 
of infection. The work of Colizza was more focused towards the spread of H5N1 
pandemic influenza arising in South-East Asia, and its potential containment using an- 
tiviral drugs. However it was H1N1 influenza from Mexico that initiated the 2009 pan- 
demic, but again the IATA flight data pr ovided a useful prediction of the early spread 
dKhan et all [20091 iBalcan et al. , 2009b). While such global movement networks are 
obviously highly important in understanding the early spread of p athogens, they unfor- 
tunately neglect more localised movements (IViboud et all 12006b) and individual-level 
transmission networks. However, recent work has aimed to overc ome this first i ssue by 
including other form s of local movement between populations ( Viboud et all l2006at 



Balca n et al. , 2009a). This work has again focused on the spread of influenza, mix- 



ing long-dista nce air travel with sho rter range commuter movements; with the model 
predictions by IViboud et al. (2006a) showing good agreement with the observed pat- 
terns of seasonal influenza. An alternative form of movement network has been in- 
ferred from the "Where' s George" study of the circulation of dollar bills in the USA 
dBrockmann et al. , 2006); this provided far more information about short-range move- 
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merits, but again did not really inform about the interaction of individuals. 



A wide variety (and in practice the vast majority) of movements are not made 
by aircraft, but are regular commuter movements to and from work. The network 
of such movements has also been studied in some detail for both the UK and USA 
dViboud et all l2006at lHall et all l2007t iDanon et all 120091) . The approaches adopted 
parallel the work done using the network of passenger aircraft, but operate at a much 
smaller scale, and again influenza and smallpox have been the considered pathogens. 
As with the aircraft network certain locations act as major hubs attracting lots of com- 
muters every day; however, unlike the aircraft network there is the tendency for the 
network to have a strong daily signature with commuters moving to work during the 



ng t( 

day but travelling home again in the evening dKeeling et al.L I201CH) . As such the com 



muter network can be thought of as heterogeneous, locally-clustered, temporal and 
with each contact having different strengths (according to the number of commuters 
making each journey); however, to provide a complete description of population move- 
ment and hence dise ase transmission requires other causes of movement to be included 
dDanon et all 120091) and requires strong assumptions to be made about individual-level 
interactions. The key question that can be readily addressed from these commuter- 
movement models is whether a localised outbreak can be c ontained w i thin a region or 
whether it is likely to spread to other nodes on the network (lHall et al. . 2007 ). 



Undoubtedly one of the largest and most comprehensive data-sets of movements 
between locations comes from the livestock tracing schemes run in Great Britain, and 
being adopted in other European countries. The Cattle Tracing Scheme in particular is 
spectacularly detailed, containing information of the movements of all cattle between 
farms in Great Britain; as such this scheme gener ates daily networks of contacts be 
tween over 30,000 working farms in Great Britain (Green et al., 2006; Robinso n et al.1 
2007l:lHeath et al.[|2008l;IVernon and Keelingll2009tlBrooks-Pollock and KeelinglbOO 1 



(Figure IF). Simil ar data also exist for the movement of batches of sheep and pigs 
(Kiss et al. , 2006b) although here the identity of individual animals making each move- 
ment is not recorded. This data source has several key advantages over other move- 
ment networks: it is dynamic, in that movements are recorded daily; the movement of 
livestock is one of the major mechanisms by which many infections are transferred be- 
tween farms; and the metapopulation assumption that cattle mix homogeneously within 
a farm is highly plausible. In principle, the information in the Cattle Tracing Scheme 
can be used to form an even more comprehensive network, treating each cow as a node 
and creating an edge if two cows occur within the same farm on the same day — this 
would generate an individual-le vel network for each day which can then be used to 
simulate the spread of infection (Keeli ng et al. , 2010b . 



The early spread of foot and mouth dise ase (FMD) in 2001 w as primarily due to 



livestock movements , particu larly of sheep (Gib bens et all 1200 II) . Motivated by this 



epidemic, iKiss et alJ (2006b) conducted short simulated outbreaks of FMD on both 
the sheep movement network based on 4 weeks' movements starting on 8 September 
2004, and simulated synthetic networks with the same degree distribution. Due to the 
short time-scales considered (the aim being to model spread of FMD before it had 
been detected), nodes were susceptible, exposed or infected but never recovered, and 
network connections remained static. Simulated epidemics were smaller on the sheep 
movement network than the random networks, most likely due to disassortative mix- 



ing in the sheep movement network. Similarly, iNatale et al. I d2009l) employed a static 
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network simulation of Italian cattle farms. Here farms were not merely represented 
as nodes, but a deterministic SI system of ODEs was used to model infection on each 
node essentially generating a metapopulation model. The only stochastic part of the 
model was the number of infectious individuals moved between connected farms in 
each time step. This simulation model highlighted the impact of the centrality of seed 
nodes (measured in several different ways) upon the subsequent epidemics' course. 

The use of static networks to model the very dynamic mo vement of livestock 
is questionable. Expanding on earlier work, iGreen et al.l (120061) simulated the early 
spread of FMD through movement of cattle, sheep, and pigs. Here the livestock net- 
work was treated dynamically, with infection only able to propagate along edges on 
the day when that edge occurred; additional to this network spread, local transmis- 
sion could also occur. These simulations enabled regional patterns of risk to a new 
FMD incursion to be assessed, as well as identify ing markets as suitable targets for en- 
hanced surveillance. IVernon and Keeling! (12009b considered the relationship between 
epidemics predicted from dynamic cattle networks and their static counterparts in more 
detail. They compared different network representations of cattle movement in the UK 
in 2004, simulating epidemics across a range of infectivity and infectious period pa- 
rameters on the different network representations. They concluded that network rep- 
resentations other than the fully dynamic one (where the movement network changes 
every day) fail to reproduce the dynamics of simulated epidemics on the fully dynamic 
network. 



2.5 Contact Tracing Networks 



Contact tracing and hence the networks generated by this method can take two distinct 
forms. The first is when contact-tracing is used to initiate pro-active control. This is 
often the case for STIs where identified cases are asked about their recent sexual part- 
ners, and these individuals are traced and tested; if found to be infected, then contact 
tracing is repeated for these secondary cases. Such a process is related to the snowball 
sampling that was discussed earlier, with the notable exception that tracing is only per- 
formed from known cases. Similar contact-tracing may operate for the early stages of 
an airborne epidemic (as was seen for the 2009 H1N1 pandemic), but here the tracing 
is not generally iterative as contacts are generally traced and treated so rapidly that they 
are unlikely to have generated secondary cases. An alternative form o f contact-tracing 



is when a transmission pathway is soug ht between all identified cases (Klovdahl, 1985 



Havdon et all l2003t iRilev et all 120031) . This form of contact tracing is likely to be- 
come of ever-increasing importance in the future when improved molecular techniques 
and statistical inference allow infection trees t o be determ i ned fr om genetic differences 
between samples of the infecting pathogen (iCottam et all 12008). 



These forms of network have two main advantages, but one major disadvantage. 
The network is often accompanied by test results for the individuals within the net- 
work, as such we not only have information on the contact process but also on the 
resultant transmission of infection. In addition, when contact tracing is only performed 
to define an infection tree, there is the added advantage that the infection process itself 
defines the network of contacts and hence there is no need for human interpretation of 
which forms of contact may be relevant. Unfortunately, the reliance on the infection 
process to drive the tracing means that the network only reflects one realisation of the 
epidemic process and therefore may ignore contacts that are of potential importance 
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and would be needed if the epidemic was to be simulated; therefore while they can 
inform about past outbreaks they have little predictive power. 



2.6 Surrogate Networks 



Obtaining large-scale and reliable information on who contacts whom is obviously 
very difficult, therefore there is a temptation to rely on alternative data sets where 
network information can be extracted far more easily and where the data is already 
collected. As such the movement networks and contact tracing networks discussed 
above are examples of such surrogate networks, although their connection to the phys- 
ical proc esses of infect i on transmission are far more clear. Other examples of networks 
aboun d ( Lilieros et al. . 2001 : NewmanL 2003; iBoccaletti etafl [2006; Newman et al. 



2006); and while these are not directly relevant for the spread of infection they do pro- 
vide insights into how networks form and grow — structures that are commonly seen 
in surrogate networks are likely to arise in the types of network associated with disease 
transmission. One source of network information that would be fantastically rich, and 
also highly informative (if not immediately relevant) is the network of friendships and 
contacts on social networking sites (such as Facebook); some sites have made data on 
their social networks available, and these data have been used to exami ne a range of 
sociological questions about online interactions ( bovd and Ellis on. 2007). 



2.7 Theoretical Constructs 

Given the huge complexity involved in obtaining large scale and reliable data on real 
transmission networks many researchers have instead relied on theoretically constructed 
networks. These networks are usually highly simplified but aim to capture some of the 
known (or postulated) features of real transmission networks — often the simplifica- 
tions are so extreme that some analytical traction can be gained. Here we briefly outline 
some of the commonly used theoretical networks and identify which features they cap- 
ture; some of the results of how infection spreads on such networks are discussed more 
fully in section l4~2l 



2.7.1 Configuration Networks 

One of the simplest forms of network is to allow each individual to have a set of con- 
tacts that it wishes to make (in more formal language each node has a set of half-links), 
these contacts are then made at random with other individuals based on the number of 



conta cts that they wish to make (half-links are randomly connected) dMollov and Reed , 



19981) . This obviously creates a network of contacts (Figure 1C). The advantage of 
these configuration networks is that because they are formed from many randomly con- 
nected individuals there are no short loops within the network and a range of theoreti 



cal results can be proved ranging from conditions for i nvasion dFisher and Esam , 1961 



Nickel and Wilkinson, 1983; Mo llov and Reedll 19951) to descriptions of the temporal 
dynamics dBall and NealluOOa) . Unfortunately, the elements that make these networks 
amenable to theoretical analysis — the lack of assortativity, short loops or clustering 
— are precisely factors that are thought to be important features of real networks. 



An alternative formulation that offers a compromise between tractability and real- 
ism occurs when individuals that exist in fully interconnected cliques have randomly 
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assigned links within the entire population ( Ball and Neal , 2008 : House and Keelinj, 



20081) (Figure ID). As such these networks mimic the strong interactions within fam- 
ilies and the weaker contacts between them. While such models offer a significant 
improvement over configuration networks, and capture the known importance of the 
household in transmission, they make no allowance for clusterin g between househ olds 



due to spatial proximity. Hierarchical metapopulation models (Wat ts et al.L 120051) al 



low for this form of additional structure, where households (or other groupings) are 
themselves grouped in an ascending hierarchy of clustering. 



2.7.2 Lattices and Small Worlds 



Both lattice networks and small world networks begin with the same formulation: in- 
dividuals are regularly spaced on a grid (usually in just one or two dimensions), and 
each individual is connected to their k nearest neighbours — these connections define 
a lattice. The advantage of such networks is that they retain many elements of the ini- 
tial spatial arrangement of points, and hence contain both many short loops as well as 
the property that infection tends to spread locally. There is a clear li nk between such 
lattice-based networks and the field of probabilistic cellular automata dLebowitz et al 



1 19901: iRhodes and Anderson!! 19971) . The fundamental difficulty with such lattice mod- 
els is that the presence of short loops and localised spread mean that is it difficult (if 
not impossible) to prove exact results and hence large-scale multiple simulations are 
required. 



Small world networks improve upon the rigid structure of the lattice by allowing a 
low number of random contacts across the entire space (Figure IE). Such long range 
contacts allow infection to spread rapidly though the population and vastly reduce the 
shortest path-length between individuals (IWatts and Strogata, 1 19981) — this is popu- 
larly known as six degrees of separation from the concept tha t any two individuals 
on th e planet are linked through at most six friends or contacts (ITravers and Mil gram , 
19691) . Therefore small world networks offer a step towards reality, capturing the lo- 



cal nature of transmission and the potential for long-range contacts (IBoots and Sasaki . 
1999t IBoots et all 120041) . however they suffer from neglecting heterogeneity in the 



number of contacts and the tight clustering of contacts within households or social set- 
tings. 



2.7.3 Spatial Networks 

Spatial networks, as the name suggests, are generated using the spatial location of all 
individuals in the population; as such lattices and small worlds are a particular form 
of spatial network. The general methodology initially positions each individual i at a 
specific location x., usually these locations are chosen at random but clustered spatial 



distributions have also been used (Badh am et all 12008). Two individuals (say i and j) 



are then probabilistically connected based upon the distance between them; the prob- 
ability is given by a connection kernel which usually decays with distance such that 
connections are predominantly localised. These spatial networks (especially when the 
underlying distribution of points is clustered) have many features that we expect from 
disease networks, although it is unclear if such simple formulations can be truly repre- 
sentative. 
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2.7.4 Exponential Random Graphs 



In recent years, there has been growing interest in exponential random graph models 
(ERGMs) for social networks, al so called the p* class of mode ls. ERGMs were first 
introduced in the early 1980's bv lHolland and Leinhardtl d 1 98 lb based on the work of 
Besa 3 (11974 . More recently Frank and Strauss studied a subset of those, that have the 



simple property that the probability of connection between two nodes is independen t 
of the connection between any other pair of distinct nodes. dFrank and Strauss! 19861) . 
This allows the likelihood of any nodes being connected to be calculated conditional on 
the graph having certain network properties. Techniques such as Markov Chain Monte 
Carlo can then be used to create a range of plausible networks that agree with a wide 
variety of in formation collected on network structures even if the complete network 
is unknown (IHandcock and JonesL 12004c iRobins et all 120041) . Due to their simplic 



ity, E RGMs are widely used by statisticians and social net work analy sts (Robi ns et al 



20071) . Despite significant advances in recent years (e.g. iGoodr eau (2007)), ERGMs 



still suffer from problems of degeneracy and computational intractability for large net- 
work sizes, which has limited their use in epidemic modelling. 



2.8 Expected Network Properties 

Here we have shown that a wide variety of network structures have been measured or 
synthesised to understand the spread of infectious diseases. Clearly, with such a range 
of networks no clear consensus can be drawn on the types of underlying network struc- 
tures that are generally present; in part this is because different studies have focused on 
different infectious diseases and different diseases require different transmission routes. 
However, three factors emerge that are key components of epidemiological networks: 
heterogeneity in the number of contacts such that some individuals are at a higher risk 
of both catching and transmitting infection; clustering of contacts such that groups of 
individuals are often highly interconnected; and some reflection of spatial separation 
such that contacts usually form locally, but occasional long-range connections do occur. 

Three fundamental problems still exist in the study of networks. Firstly, are there 
relatively low-dimensional ways of capturing key aspects of a network's structure? 
What constitutes a key aspect will vary with the problem being studied, but for epidemi- 
ological applications it should be hoped that a universal set of network characteristics 
may emerge. There is then the task of assessing reasonable and realistic ranges for 
these key variables based on values computed for known transmission networks — un- 
fortunately very few transmission networks have been recorded in any degree of detail, 
although modern electronic devices may simplify the process in the future. Secondly, 
there is the related statistical problem of inferring plausible complete networks from 
the partial information collected by methods such as contact tracing. This is equivalent 
to seeking an underlying model for the network connections that is consistent with the 
known partial information, and hence has strong resonance with the more mechanis- 
tically motivated models in section 12.31 Even when the network is fully realised (and 
an epidemic observed) there is considerable statistical difficulty in attributing risk to 
particular contact types. Finally, there are the key questions of predicting the dynam- 
ics of infection on any given network — and while for many complex networks direct 
simulation is the only approach, for other simplified networks some analytical traction 
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can be achieved, which helps to provide more generic insights into which elements of 
network structure are most important. These three key areas are discussed below. 



3 Network Properties 



Real networks can exhibit staggering levels of complexity. The challenge faced by re- 
searchers is to try and make sense of these structures and reduce the complexity in a 
meaningful way. In order to make any sense of the complexities present, researchers 
over several decades have defined a large variety of measurable properties that can be 



used to charac t erise c ertain key aspects dAlbert and Barabasu, 120021 : iNewmanl 12003 
" 20061) . Here we describe the definitions of the most important char- 



Newman et al. 



acterisations of complex networks (in our view), and outline their impact on disease 
transmission models. 



3.1 Components 

In general, networks are not necessarily connected; in other words, all parts of the net- 
work are not reachable from all others. The component to which a node belongs is 
that set of nodes that can be reached from it by paths running along edges of the net- 
work. A network is said to have a giant component if a single component contains the 
majority of nodes in the network. In directed networks (one in which each edge has 
an associated direction) a node has both an in-component from which the node can be 
reached, and an out-component can be reached from that node. A strongly connected 
component (SCC) is the set of nodes in the network in which each node is reachable 
from every other node in the component. 

The concept of a giant component is central when considering disease propagation 
in networks. The extent of the epidemic is necessarily limited to the number of nodes 
in the component that it begins in, since there are no paths to nodes in other compo- 
nents. In directed networks, in the case of a single initial infected individual, only the 
out-component of that node is at risk from infection. More generally, the strongly con- 
nected component contains those nodes that can be reached from each other. Members 
of the strongly connected component are most at risk from infection imported at a ran- 
dom node, since a single introduction of infection will be able to reach all nodes in the 
component. 

3.2 Degrees, Distributions and Correlations 

The degree is defined as the number of neighbours that a node has and is most often 
denoted as k. In directed graphs, the degree has two components, the number of in- 
coming edges k"\ (in-degree), and the number of outgoing edges k m ", (out-degree). 
The degree distribution is defined as the set of probabilities, P(k), that a node chosen 
at random will have degree k. Plotting the distribution of degrees of nodes is one of 
the most basic and important ways of characterising a given network (Figure 2). In ad- 
dition, useful characterisations are obtained by calculating the moments of the degree 
distribution. The n' 1 ' moment of P(k) is defined as: 

<r> = JVp(fc), 

k 
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with the first moment, (k), being the average degree, the second, (Ic 2 ) allowing us to 
calculate the variance (Ic 2 ) - (Ic) 2 , and so on. 

The degree distribution is one of the most important ways of characterising a net- 
work as it naturally captures the heterogeneity in individuals' potential to become in- 
fected as well as cause further infection. Intuitively, the higher the number of edges a 
node has, the more likely it is to be a neighbour of an already infected node. Also, the 
more neighbours a node has, the more likely it is to cause a large number of onward 
cases. Thus, knowing the form of P(k) is crucial for the understanding of the spread of 
disease. In random networks of the type studied by Erdos and Renyi, P(k) follows a 
binomial distribution, which is effectively Poisson in the case of large networks. Most 
real social networks have distributions that are significantly different from the random 
case. 



For the extreme case of P(k) following an unbounded power law a nd assuming 
equal transmission across all edges. lPastor-Satorras and Vespignani d200ll) showed that 
the cl assic result of the epidemic threshold from mean field theory ([Anderson and May , 



1992b breaks down. In real transmission networks, the d istribution of degree is often 



heavily skewed, and occasionally follows a power law dLilieros et aU 12001). but is 



always bounded, leading to the recovery of epide mic threshold, but one w hich is much 
lower than expected in evenly mixed populations (ILlovd and M ay. 2001). 



Degree (k) 

A) Degree distributions for two 
classes of networks: scale free 
and random networks. 





B) Example random network 
with 100 nodes and 300 links. 
All nodes have similar 
numbers of links. 



C) Example scale-free network 
with 100 nodes and 300 links. 
Most nodes have few links, with 
a few nodes having many links. 



Figure 2: Comparison of random and scale-free networks 



The degree distribution provides very useful information on uncorrelated networks 
such as those produced by configuration models. However, real networks are in gen- 
eral correlated with respect to degree; that is, the probability of finding a node with 
given degree, k, is dependent on the degree of the neighbours of that node, k', which 
is captured by the conditional probability P(k' \ k). To characterise this behaviour sev- 
eral measurements have been proposed. The most straightforward, and probably most 
useful measure is to consider the average degree of the neighbours of a node: 

k ■-- V k 

' jcNbrs; 

where the sum of degrees is made over the neighbours (Nbrs) of i. One can then 
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calculate the average of k nn over all nodes with degree k which is a direct measure of 
the conditional probability P(k' \ k), since 

k nn (k) = J]k'P(k'\k). 

k> 

When k m (k) increases with k, the network is said to be assortative on the degree, that is, 
high-degree nodes have a tendency to link to other high degree nodes, a behaviour of- 
ten observed in social networks. Other types of networks, such as the internet at router 
level, show the converse behaviour , i.e., nodes of high degree tend to link to nodes with 
low degree jNewmanL Eool 120031) . 



Characterising degree correlations is important for understanding disease spread. 
The classic example is the existence of strong correlations in sexual networks which 
were shown to be a key factor in understanding HIV spread ( Gupta et al. . 1989b . More 
recently, mean field solutions of the SIS model on networks have shown that both 
the speed and ext ent of an epidemic are dependent on the correlation pattern o f the 
substrate network ( Boguna and Pastor- Satorras , 2002 ; Eguiluz and Klemm , 2002h . 



3.3 Distances 

In a network, the shortest path between two nodes i and /', is the path requiring the 
smallest number of steps to reach j from i, following edges in the network. There 
may be (and often there is) more than one shortest path between a pair of nodes. The 
distance between any pair of nodes d/j is the minimal number of steps required to reach 
j from i, that is the number of steps in the shortest path. The average distance, (d) is 
the mean of the distances between all pairs of nodes and measures the typical distance 
between nodes: 

where N is the number of nodes in the network. The diameter of the network is de- 
fined as the maximum shortest path distance between a pair of nodes in the network, 
max(dij), which measures the most extreme separation of any two nodes in the net- 
work. 



Characterising networks in terms of the number of steps needed to reach any node 
from any other is also important. Real networks frequently display the small-world 
property, that is, the vast majority of nodes are reachable in a small number of steps. 

This has clear implications for disease spread and its control. Percolation approaches 

have shown that the effects of the small world phenomenon can be profound dMoore and Newman , 
2000). If it only takes a short number of steps to reach everyone in the population, dis- 



eases are able to spread much more rapidly. 



The notion of shortest distance through a network can be used to quantify how cen- 
tral a given node is in the network. Many measures have been used dWasserman and Faust , 



19941) . but the most relevant of these is betweenness centrality. Betweenness captures 



the idea that the more shortest paths pass through a node, the more central it is in the 
network. So, betweenness is simply defined as the proportion of shortest paths that 
pass through a single node. 
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# shortest paths through i 
Bi ~ N{N - 1) ' 

where N is the number of nodes in the network, and the denominator quantifies the total 
number of shortest paths in the network. In terms of disease spread, identifying those 
nodes with high betweenness will be important. Central nodes are likely to become 



infect ed early on in the epidemic, and are also key targets for intervention (iBeil et al. 
1999h . 



3.4 Clustering 

An important example of an observable property of any network is the clustering coef- 
ficient, <f>, a measure of the the local density of a graph. In social network terms, this 
quantifies the likelihood that the friend of your friend is also your friend. It is defined 
as the probability that two neighbours of a node will also be neighbours of each other 
and can be expressed as follows: 

3 x # of triangles in the network 

<P — > 

# of connected triples 

where a connected triple means a single node with edges to a pair of others. <p measures 
the fraction of triples that also form part of a triangle. The factor of three accounts for 
the fact that each triangle is found in three triples and guarantees that < </> < 1 (and 
its inclusion depends on the way that triangles in the network are counted). 



Locally, the clustering coefficient for each node, i, ca n be defined as the fracti on of 
triangles formed through the immediate neighbours of i ( Watts and StrogatJ 1998 ). 



# triangles centered on i 

<f>i = . 

# triples centered on i 

The clustering property of networks is essential to the understanding of transmission 
processes. In clustered networks, rapid local deple tion of susceptible individuals plays 
a hug ely important role in the dynamics of spread dKeeling[|l999tlEames and Keelin j. 



2002); for a more analytic treatment of this, see section l4~2l below. 



3.5 Subgraphs 

Degree and clustering characterise some aspects of network structure at an individ- 
ual level. Considering distances between nodes provides information about the global 
organisation of the network. Intermediate scales are also present and characterising 
these can help in our understanding of network structure and therefore the dynamics of 
spread. 



At the simplest level, networks can be thought of being comprised of a collection 
of subgraphs. The simplest subgraph, the clique, is defined as a group of more than 
two nodes where all the nodes are connected to each other by means of edges in both 
directions. In other words, a clique is a fully connected subgraph, with the smallest 
example being a triangle. This is a strong definition and one which is only fulfilled in 
a limited number of cases, most notably households (see Figure ID, section l4~2l and 
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House and Keeling (|2008)). n-cliques relax the above constraint, while retaining its 
basic premise. The shortest path between all the nodes in a clique is one. Allowing this 
distance to take higher values, one arrives at the definition of n-cliques, which are de- 
fined as a subgroups of the graph containing more than two nodes where the maximum 
shortest path distance between any two nodes in the group is n. Over the years many 
variants of these basic ideas have been formalised in the social network literature and 



good summary can be found in lWasserman and Faust dl994h . 



Considering higher order structures can be very informative but is more involved. 
Milo and co-workers began by looking for specific patterns of connections between 
nodes in small sub-graphs, dubbed motifs. Given a connected sub-graph of size 3 (for 
example), there are 13 possible motifs. Statistically, some of these appear more often 
and are found to be over- represented in certain real networks compared to random net- 
works dMilo et al. . 2002 ). Understanding the motif composition of a complex network 



has been shown to improve the predictive power of determin istic models of trans mis- 
sion when motifs are explicitly modelled (see section 14721 and iHouse et al. (2009a)). 



In the above definitions, a subgraph has been defined only in reference to itself. A 
different approach is to compare the number of internal edges to the number of external 
edges, arising from the intuitive notion that a community will be denser in terms of 
edges than its surroundings. One such definition, the definition of community in the 
strong sense, is defined as a subgraph in which each node has more edges to other nodes 
within the subgraph than to any other nodes in the network. Again, this definition is 
quite restrictive, and in order to relax these constraints, the most commonly used (and 
most intuitive) definition of communities is groups of nodes that have a high density 
of edges within them and a lower density of edges between groups. This intuitive 
definition is behind the most widely used approach for studying community structure 
in networks. Newman and Girv an formalised this in terms of the modularity measure 
Q (INewman and Girvanl 120041) . Given a particular network which is partitioned into 
communities, the modularity measure compares the expected number of edges within 
communities to the actual number of edges within communities. 

Although the impact of communities in transmission processes has not been fully 
explored, a few studies have shown it can have a profound impact on disease dynamics 



( Buckee et al. , 2007 : Salathe and Jonesl 2010l). An a l ternat ive measure of how "well 



knit" a graph is, named conductance (IKannan et a l.. 2004), most widely used in the 



computer science liter ature has also been found to be important in a range of networks 
( Leskovec^taD. l2009h . 



3.6 Higher Dimensional Networks 

All of the above definitions have concentrated on networks where the edges remain 
unchanged over time and all edges have equal weight. Both of these constraints can 
naturally be relaxed, but generally this calls for a higher-dimensional characterisation 
of the edges within the network. It is a matter of common experience that social in- 
teractions which can lead to infection do change, with some contacts being repeated 
regularly, while others are more sporadic. The frequency, intensity and duration of 
contacts are all time-varying. How these inherently dynamic networks are represented 
for the purposes of modelling can have a s ignificant impact on the model outcomes 
( Vernon and Keeling , 120091: iKao et al. , 20 06). However, capturing the structure of such 
dynamic networks in a parsimonious manner remains a substantial challenge. More 
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work has been done on weighted networks, as the se are a more straight-forward exten - 



sion of the classical presence-absence networks (Barr at et all l2004t iNewm an. 2004) 



In terms of disease sprea d, the movement networks discussed in sect ion l2~4l ar e often 
consi dered as weighted ( Hufnageleta il 12004 Vibou d et all 12006a: Robinso n et al. 



2007). 



In the sections that follow we discuss how these network properties can be inferred 
statistically and the improvements in our understanding of the transmission of infection 
in networks that have come as a result. 



4 Model Formulation 

4.1 Techniques for Simulation 

One of the key advantages of the simulation of disease processes on networks is that 
it enables the study of systems that are too complex for analytical approaches to be 
tractable. With that in mind, it is worth briefly considering efficient approaches to dis- 
ease simulation on networks. 



There are two main types of simulation model for infectious diseases on networks: 
discrete-time and continuous-time models; of these, discrete-time simulations are more 
common, so we discuss them first. In a discrete-time simulation, at every time-step dis- 
ease may be transmitted along every edge from an infectious node to a susceptible node 
with a particular probability (which may be the same for all extant edges, or may vary 
according to properties of the two nodes or the edge). Also, nodes may recover (be- 
coming immune, or reverting to being susceptible) during each time-step. Within a 
time-step, every infection and recovery event is assumed to occur simultaneously. In a 
dynamic network simulation, the network is typically updated every time-step — for 
example, in a livestock movement network, during time-step x, infection could only 
transmit down edges that occurred during time-step x. Clearly, in a directed network, 
infection may only transmit in the direction of an edge. 



Whilst algorithms for discrete-time simulations are not complex, some simple im- 
plementation techniques (arising from the observation that most networks of epidemi- 
ological interest are sparse) can significantly enhance software performance. In a di- 
rected network with N nodes, there are N(N - 1) possible edges; in a sparse network 
with mean node degree k, there are Nk <sc N(N - 1) edges. Accordingly, rather than 
representing the network as an N by N array, where the element in each array is if 
the edge is absent, nonzero otherwise, it is usually more efficient to maintain a list of 
the neighbours of each node. Then, if a list of infected nodes is maintained during a 
simulation run, it is straightforward to consider each susceptible neighbour of an in- 
fected node in turn and test if infection is transmitted to that node. Additionally, a fast 
high-qualit y pseudo-random number gene rator such as the Mersenne Twister should 
be used dMatsumoto and Nishimuralll998l) . The "contagion" sof tware package imple- 
ments these techniques (amongst others), and is freely available (I Vernon 120071) . 



The alternative approach to simulating disease processes on networks is to simulate 
a series of stochastic Markovian events — the continuous-time approach. Essentially, 
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given the state of the system, it is possible to calculate the probability distributions of 
when possible subsequent events (i.e. recovery of an infectious node or infection of 
a susceptible node) will occur. Random draws from these distributions are then made 
to determine which event occurs next, the state of the system updated, and the process 
repeated. This approach w as pioneered by Gillespie to study the dynamics of chemical 
reactions OGille spiel 119771) ; it is, however, co mputationally in tensive, so approxima- 
tions have been developed. The r-leap method ( lGillespiell200lb . where multiple events 
are allowed to occur during a time period r, is clearly related to the discrete-time for- 
mulation discussed above. However, t he ability to allow t to vary during a simulation 
to account for the processes involved dCao et all l2006h has potential benefits. 

The continuous-time approach is clearly in closer agreement with the ideal of stan- 
dard disease models, however utilising this method may be computationally prohibitive 
especially when large networks are involved. Discrete-time models may provide a 
viable alternative for three main reasons. Firstly, as the time-steps involved in the 
discrete-time model become sufficiently small, we would expect the two models to 
converge. Secondly, inaccuracies due to the discrete-time formulation are likely to be 
less substantial in network models compared to random-mixing models, providing two 
events do not occur in the same neighbourhood during the same time-step. Finally, the 
daily cycle of contacts that regulate most of our lives means that using time-steps of 
less than 24 hours may falsely represent the temporal accuracy that can be attributed to 
any simulation of the real world. 



4.2 Analytic Methods 

In this section we use the word 'analytic' broadly, to imply models that are directly nu- 
merically integrable, without the use of Monte Carlo simulation methods, rather than 
systems for which all results can be written in terms of fundamental functions, of which 
there are very few in epidemiology. Analytic approaches to transmission of infection 
on networks fall into three broad categories. Firstly, there are approaches that calcu- 
late exact invasion thresholds and final sizes for special networks. Secondly, there are 
approaches for calculating exact transient dynamics, including epidemic peak heights 
and times, but again these only hold in special networks. Finally, there are approaches 
based on moment closure that are give approximately correct dynamics for a wide class 
of networks. 

Before considering these approaches on networks, it is worth considering what is 
meant by non-network mixing, and showing explicitly how this can derive the standard 
transmission terms from familiar differential equation models. Non-network mixing 
can be taken to have one of two meanings: either that every individual in the population 
is weakly connected to every other (the mean-field assumption), or that an Erdos-Renyi 
random graph defines the transmission network, depending on context. To see how this 
determines the epidemic dynamics, we consider a population of N individuals, with a 
homogeneous independent probability q that any pair of individuals is linked on the 
network, which gives each individual a mean number of edges h = q(N - 1). We 
then assume that the transmission rate for infection across an edge is r and that the 
proportion of the population infectious at a time t is 7(f); then the force of infection 
experienced by an average susceptible in the population is nr/(f) = /3I(t). The quantity 
/3 therefore defines a population-level transmission rate that can be interpreted in one 
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of two ways as N — » oo. In the case where the population is assumed to be fully con- 
nected, the limit is that q is held at unity, and so t is reduced to as N is increased to 
hold q(N - 1)t constant. In the case where the population is connected on a random 
graph, q is reduced as N is increased to hold h constant. 

In either case, having defined an appropriate population-level transmission rate, 
a stochastic susceptible-infectious model of transmission is defined through a Markov 
chain in which a population with X susceptible individuals and Y infectious individuals 
transitions stochastically to a population with X - 1 susceptible individuals and Y + 1 
infectious individuals at rate /3XY/(N - 1). Then the exact mean behaviour of such a 
system in the limit N — > oo then has its transmission behaviour captured by 

S= -pS{t)I{t), (1) 

where S, I are the proportion of individuals susceptible and infectious respectively. The 
mathematical formalism behin d deriving su ch sets of ordinary differential equations 
from Markov chains is given bv lKurta d 19701) . and a summ ary of the application of this 
metho dology to infectious disease modelling is given in iDie kmann an d Heesterbeekl 
(2000). However, it should be clear that equation (1) is familiar as the basis of all 
random-mixing epidemiological models. 



In the case of exponentially-distributed infectious periods and recovery from infec- 
tion offering long-lasting immunity, the standard SIR equations provide an exact de- 
scription of the mean behaviour of this system. Nevertheless, the existence of waning 
immunity, a latent period between an individual becoming infected and being able to 
transmit infection, and non-exponenti ally distributed recovery periods are also impor- 
tant for epidemiologica l applications ( Anderson and Mavl 1 1 9921; Keeling and RohanH 
20071: iRoss et all [2010). These can often be incorporated into analytical approaches 
through the addition of extra disease compartments, which necessitates extra algebraic 
and computational effort but typically does not require a fundamental conceptual re- 
evaluation. Sometimes significant additional complexity does not even modify quanti- 
tative epidemiological results — for example, regardless of the rate of waning immunity, 
length of latent period, or infectious period distribution, if the mean infectious period 
is T then the basic reproductive ratio is 



Rq=PT 



(2) 



The estimation of this qua ntity for complex disease hist ories, from data likely to be 
available, is considered by Wallinga and Lipsitchl ( 2007 ). We therefore focus on the 
transmission process, since this is most affected by network structure, and other ele- 
ments of biological realism typically act at the individual level. An important caveat to 
this, however, is when an infected individual's level of transmissibility varies over the 
course of their infectious period, which sets up correlations between the processes of 
transmission and recovery that pose a part icular challenge f or analytic work, especially 
in structured populations, as noted by e.g. Ball et al. ( 20091) . 



4.2.1 Exact Invasion 

For non-network mixing, the threshold for invasion is given by the basic reproduc- 
tive ratio Rq, defined as the expected number of secondary infectious cases created by 
an average primary infectious case in an otherwise wholly susceptible population. In 
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structured populations, this verbal definition is typically altered to be the secondary 
cases caused by a typical primary case once the dynamical system has settled into its 
early asymptotic behaviour. As such, the threshold for invasion is Rq = 1: for values 
above this an infection can grow in the population and the disease can successfully 
invade; for values below it each chain of infection is doomed to eventual extinction. 
Values of Rq can be measured directly during the course of an epidemic by detailed 
contact tracing, however there are considerable statistical issues concerning censoring 
and data quality. 

Provided there are no short closed loops in the network, Rq can be defined through 
a next-generation matrix: 

[km](m - 1) 

K km = 1— ; P (3) 

m\m\ 

where Kk m defines the number of cases in individuals with k contacts from an indi- 
vidual with m contacts during the early stages of the epidemic. Here and elsewhere 
in this section we use square brackets to represent the numbers of different types on 
the network; hence [m] is the number of individuals with m edges in the network and 
[km] is the number of edges between individuals with k and m contacts respectively. 
In addition p is the probability of infection eventually passing across the edge between 
a susceptible-infectious pair (for Markovian recovery rate y and transmission rate t 
this is given by p = t/(t + y)). The basic reproductive ratio is given by the dominant 
eigenvalue of the next-generation matrix: 

flo = lltfWII ■ (4) 

This quantity corresponds to the standard verbal definition of the basic reproductive 
ratio, and correspondingly the invasion threshold is at Rq = I. 

Once an appreciable number of short closed loops are present in the network, exact 
threshold parameters can still sometimes be de fined, but these typ ically depart from the 



standard verbal definition of Rq. For example, Ball et al. ( 20091) consider a branching 
process on cliques (households) connected to each other through configuration-model 
edges — cliques are connected to each other at random (Figure ID). By considering the 
number of secondary cliques infected by a clique with one initial infected individual, a 
threshold called R* can be defined. (For the configuration-model of households where 
each household is of the same size and each individual has the same number of random 
connections outside the household, the threshold R* is given later as equation[T3j how- 
ever the methodology is far more general). The calcul ation of the invasion threshold 
for the recently defined Triangular Configuration Model ( Miller, l2009UNewma"rll2009h 



involves calculating both the expected number of secondary infectious indi viduals and 



triangles rather than just working at the individual level. Trapmanl (2007) deals with 
how these sort of results can be related to more general networks through bounding. 
A general feature of clustered networks for which exact thresholds have been derived 
so far is that there is a local-global distinc tion in transmission routes, with a general 



theory of this given by Ball and Neall ( 2002 ), where an 'overlapping groups' and 'great 



circle' model are also analysed. Nevertheless, care still has to be taken in w hich thresh- 



old pa rameters are mathematically well behaved and easily calculated (e.g. Pell is et al. 
20091) . 
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4.2.2 Exact Final Size 



The most sophisticated and general way to obtain exact results for the expected final 
size of a major outbreak on a network is ca lled the su s ceptib ility set argument and the 
most general version is curre ntly given b y Ball et al. (2009). We give an example of 
these kind of arguments from Diekman n et al.1 ( 1998). who consider the simpler case 
of a network in which each individual has n contacts. Where there is a probability p of 
infection passing across a given network link (so for transmission and recovery at rates 
t and y respectively, p — t/(t + y)) the probability that an individual avoids infection 
is given by 

S x = (1 -p + Sp) , 

(5) 

S=(l-p + Sp) . 

Here, a two-step process is needed because in an unclustered, regular graph two gener- 
ations of infection are needed to stabilise the network correlations and so the auxiliary 
variable § must also be solved for. Once this and S cc are known, the expected attack 
rate is Roa = 1 - S M . 



4.2.3 Approximate Final Size 

The main way to calculate a pproximate fi nal sizes is given by percolation-based meth- 
ods. These were reviewed by Bansal et al. 
we remove a fraction tp of links from the network, and can derive an expression for the 
fraction of nodes remaining in the giant component of the network, f{tp). Then 

Roo*f(l-p), (6) 

and an invasion threshold is given by the value of p for which this final size becomes 
non-zero in the 'thermodynamic limit' of very large network size. This approach is 
not exact for clustered graphs, but for unclustered graphs exact results like (0 are 
reproduced. 



(2007) and also in Newman (2010). Suppose 



4.2.4 Exact Dynamics 

Some of the earliest work on infectious diseases involved the exact solution of master 
equations (where the probability of the population being in each p ossible config uration 



is calculated) on small, fully connected graphs as summarised in Bailev (1975). The 
rate at which the complexity of the system of master equations grows means that these 
equations quickly become too complex to integrate for the most general network. The 
presence of symmetries in the network, however, does mean that automorphism-driven 
lumping is one way to manipulate the master equati ons (whilst preser ving the full 



stochastic information about the system) for solution ( Sim on et al. . 20ld) . At present, 



this technique has only been applied to relatively simple networks, however there are 
no other highly general methods of deriving exact lower-dimensional systems of equa- 
tions from the master equations. 



Nevertheless, other specific routes do exist that allow exact systems of equations 
of lower dimensionality to be derived for special networks. For static networks con- 
structed using the configuration model (where individuals have heterogeneous degree 
but connections are made at random such that the presence of short loops can be ig- 
nored in a large network, see Figure 1C), an exact system of equations for SIR dynamics 
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in the limit of large network size was provided bv lBall and N eal (2008). This construc- 
tion involves attributing to each node an 'effective degree', which starts the epidemic 
at its actual degree, and measures connections still available as routes of infection and 
is therefore reduced by transmission and recovery. Using notation consistent with else- 
where in this paper (and ignoring the global infection terms that were included by Ball 
and co-workers) this yields the relatively parsimonious set of equations: 



S k = -p((T + y)kS k -r{k+l)S M ) , 

h =r({k+ 1)4 - kl k ) - yl k +p[(k + 1) (T(S k+ i + /jfc+i) + yl k+ i) - k(r + 
Yjkkh 

P ■= 



y)h) 



(7) 



Here S k , h are the proportion of effective degree k susceptible and infectious individ- 
uals respectively. Hence for a configuration-network where the maximum degree is K 
we require just 2K equations to retrieve the exact dynamics. 

While Ro can be derived using expressions like calculation of the asymptotic 
early growth rate r requires systems of ODEs like (Q. If we assume that transmission 
and recovery are Markovian processes with rates r and y respectively, two measures of 
early behaviour are 



CM 



(n(n - 1)> 



„CM 



<n> 



t + y 



(n(n-Z)) 
(n) 



7: 



(8) 



where < . > informs about the average over the degree distribution. These quantities 
tell us that the susceptibility to invasion of a network increases with both the mean and 
the varia nce of the degree dis tribution. This closely echos the results for risk-structured 
models (lAnderson and Mavl 1 19921) but with an extra term of - 1 due to the network, 
representing the fact that the route through which an individual acquired infection is 
closed off for future transmission events. 



For more structured networks with a local-global distinction, there are two limits in 
which exact dynamics can also be derived. If the network is composed of m commu- 
nities of size ni,..., n m , with the between-community (global) mixing determined by 
a Poisson process with rate he and the within-community (local) mixing determined 
by a Poisson process with rate hi, then in the limit as the communities become large, 
m —> oo, the epidemic dynamics on the system are 



Sa — S a 



la — S a 



b 

bi=a ) 



(9) 



-74, 



where S a and I a are the proportion of individuals susceptible and infectious in commu- 
nity a, and 

a = , p L = n L T . (10) 

(m - 1) 



Hence, we have a classic metapopulation model (Hanski and Gaggiotti, 2004), defined 
in terms of Poisson local and global connections and large local community sizes. 
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In the limit where — > (n — 1) and m — > oo — such that there are infinitely many 
communities of equal size and each comm unity forms a fully i nter connected clique — 
then 's elf-consistent' equations such as in lGhoshal et al. ( 2004 ) and House and Keeling! 
(2008) are exact. These equations evolve the proportion of cliques with x susceptibles 
and y infecteds, P xy , as well as the proportion of infecteds in the population, /, as 
follows: 

I=-YyP x , y , 

n 



i\, - y(-v/\,. -<v- n/\.,.i) 

+ r(-xyP x , y + (x + l)(y - l)P t+ i,v-i) 

+ P G l(-XP x ,y + (X+l)P x+1 ,y- l ) , 



(ID 



where flc = ncT. 



Both of these two local-global models, the metapopulation model (0 and the small 
cliques model (fTTI) . are reasonably numerically tractable for modern computational 
resources, provided the relevant finite number (m or n respectively) is not too large. 
The basic reproduction number for the first system is clearly 

Ro = -(J3 L + (m-l)a), (12) 

r 

while for the second, household model, invasion is determined by 

= z^(r,r), (13) 

T + J 

where Z^(t, y) is the expected final size of an epidemic in a household of size n with 
one initial infected. Of course, the within- and between-community mixing for real net- 
works is likely to be much more complex than may be captured by a Poisson process, 
but these two extremes can provide useful insights. These models show that network 
structure of the form of communities reduces the potential for an infectious disease to 
spread, and hence greater transmission rates are required for the disease to exceed the 
invasion threshold. 



4.2.5 Approximate Dynamics 

While all the exact results above are an important guide to intuition they only hold for 
very specialised networks. A large class of models exists that form a bridge between 
'mean-field' models and simulation by using spatial or network moment closure equa- 
tions. These are highly versatile models. In general, invasion thresholds and final sizes 
can be calculated rigorously, but exact calculation of transient dynamics is only pos- 
sible for very special networks. If one wants to calculate transient effects in general 
network models — most importantly, peak heights and times — then moment closure is 
really the only versatile way of calculating desired quantities without relying on full 
numerical simulation. 

It is also worth noting that there are many results derived through these 'approxi- 
mate' approaches that are the same as exact results, or are numerically indistinguish- 
able from exact results and simulation. We give some examples below, and also note 
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that the dynamical PGF approach ( Volz, 120081) is nume rically indistinguishab le from 



the exact model (0 above for certain parameter values ( Lindquist et al. , 2010i) . What 
is currently lacking is a rigorous mathematical proof of exactness for ODE models 
other than those outlined in section 14.2.41 above. While for many practical purposes 
the absence of such a proof will not matter, we preserve here the conceptual distinction 
between results that are provably exact, and those that are numerically exact in all cases 
tested so far. 

The idea of moment closure is to start with an exact but unclosed set of equations 
for the time evolution of different units of structure. Here we show how these can 
be derived by considering the rates of change of both types of individual and types 
of connected pair. Such pair-wise moment closure model are a natural extension to 
the standard (random-mixing) models, given that infection is passed between pairs of 
infected individuals: 

[Sk] = -t[S k «- 7] , 
[i K \ = r[S l( ^I]-y[I K ], 
[WJ = -«^/] + W^i], 

[SJ A ] = t([S k S a ^7] - [7 ^ SJ A ] - [S K <- h])-y[SM , (14) 
IUa] = t([7 -» SJ A ] + [/ -» SM + [S K «- I A ] + [S A <- 7J) - 2y[IJ A ] , 

= -t[7 -> SA] + y[SJ A ] , 
[7A] = t[I -> S K R A ] + y([IJ A ] - [I K R A ]) . 

Here we use square brackets to represent the prevalence of different species within the 
network. We also use some non-standard notation to present several diverse approaches 
in a unified framework: generalised indices k, A represent any property of a node (such 
as its degree); while arrows represent the direction of infection (and so for a directed 
network, the necessity that an edge in the appropriate direction be present). 

Clearly, the system (TPfl i is not closed as it relies on the number of connected triples, 
and so some form of approximate closure must be introduced to relate the triples to 
pairs and nodes, which will depend on underlying properties of the network. Most 
commonly, these closure assumptions deal with heterogeneity in node degree, a ssor- 
tativity, and clustering at t he level of triangles. Examples include KeelingT ll 19991) and 



Eames and Keeli ng (2002), where the generalised variables k, A above stand for node 
degrees (k, I), the triple closure is symmetric with respect to the direction of infection, 
and the network is assumed to be static and non-directed. A general way to write the 
closure assumption is: 

x a j) t~* i C*- 1 ) („ ^ [AkB,][B,C m ] ^nN [A k B,][B,C m ][C m A k ] \ 

[A k B,C m ]~ — : — (1-0) — +<p- lrBlrr 1 ■ (15) 

I \ [B{\ km [A k ][B,][C m ] j 

where n = (n) is again the average degree distribution, and (f> measures the ratio of 
triangles to triples as a means of capturing clustering within the network (see section 
13.41) . The typical way to analyse the closed system is direct numerical integration, 
however some analytic traction can be gained. One example is the use of a linearising 
Ansatz to derive the early asymptotic behaviour of the dynamical system. Interestingly, 



w hen this is done for <f > = ( such that there are no triangular loops in the network) as 
in Eames and Keelin 3 d2002l) . the resul t for the early asymptotic growth rate agrees 



with the exact result of equation (Q. In Keeling ( 19991) . the differential equations for 
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an n-regular graph were also manipulated to give an expression for final size that agreed 
with the exact result (O 



Equation ([T5| howev er, i s not th e only possible network moment closure regime: 
isaki 



Boots and Sasakil (2002) and Bauchl (2005) considered regimes in which closure de- 
pended on the disease state (i.e. triples composed of different arrangements of sus- 
ceptibles and infecteds close differently) to deal with spatial lattice-base d systems and 
early disease invasion respectively. For example lBoots and Sasakil d2002l) use a closure 
where 

n-1 [SO] [00] 



[SOO] * s 



[IOO] 



n [O] 
n-1 [IO][00] 



[SOS] ~ [OS] 



n [O] 
n-1 



« 



1 -E 



[ABC] 



h- 1 [AB][BC] 
~n [B] 



[00] UP] 
[O] [O] 

for all other triples, 



(16) 



where O represents empty sites within the network that are not currently occupied by 
individuals, and the parameter s = 0.8093 accounts for the clustering within lattice- 
based networks. House and Keeling] ( 201 Obi) considered a model of infection transmis- 
sion and contact tracing on a network, where the closure scheme for [ABC] triples was 
asymmetric in A and C - this allowed the natural conservation of quantities in a highly 
clustered system. 



The work on dynamical PGF models ( Volz, 2008) can be seen as an elegant sim- 
plification of this pairwise approach that is valid for SIR-type infection dynamics on 
configuration model networks. The equations can be reformulated as: 

S = g(0) 



Tp,0g'(0) - yl 
g'(0) 



Pi = Tp s pit 



Tpi(l - Pi) -J Pi 



(17) 



Ps = rps pi J 1 - 
= - Tpi 



'(&) 



g'(0) 



where g is the probability generating function for the degree distribution, p$ and p; 
correspond to the number of contacts of a susceptible that are susceptible or infected 
respectively, and is defined as probability that a link randomly selected from the entire 
network has not been associated with the transmission of infection. Here the closure 
assumption is implicit in the definition of S ; that an individual only remain susceptible 
if all of its links have not seen the transmission of infection and that the probability 
is independent for each link, which is comparable to the assumptions underlying the 
formulation bv lBall and Neall (2008), equation Q. The precise lin k between this PGF 
form ulation and the pairwise approach is discussed more fully in iHouse and Kee ling 
d2010al) . 



There are many other extensions of this general methodology that are possible. 
Writing ODEs for the time evolution of triples and closing at a higher order allows the 
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consid er ation of the epidemio logical consequences of varying motif structure dHouse et al 
2009b). Sharkexetal] {2006) considered closure at triple level on directed networks, 



which involved a more sophisticated treatment of third-order clustering due to the 
larger repertoire of three-motifs in directed (as compared to undirec ted) networks. It 



is also possible to combine stochastic and network moment closure (Dangerfie ld et al 



2008b . Time-varying, dynamical networks, particularly applied to sexually transmitted 



infections where partnerships vary over the course of an e pidemic, were considere d 
using approxim ate ODE-based models by Eamesl (120041) . and lVolz and Me vers! (12007b . 
I Shar key (2008) considered models appropriate for local networks with large shortest 
path lengths, where the generic indices //, A in (TBl i stand for node numbers i, j rather 
than node degrees k, I. 



Another approach is to approximate the transmission dynamics in the standard 
(mean-field) differential equations models. Essentially this i s a form of moment clo- 
sure at the level of pairs rather than triples. For example, in iRov and Pascuall (2006) 
the transmission rate takes the polynomial form: 



Transmission rate to S k from // oc kl(S kf{I{f 



(18) 



where the the exponents, p and q, are typically fitted to simulated data but a re thought 



to captu re the spatial arrangement of susceptible and infected nodes. Also. lKiss et al 
(2006a) suggest: 



Transmission rate to S k from // oc k(l - l)(S k)(Ji) , (19) 
as a way of accounting for each infected 'losing' an edge to its infectious parent. 



Finally, very recent work ( Volz . 2010h presents a dynamical system to capture epi- 
demic dynamics on triangular configuration model networks; the relationship between 
this and other ODE approaches is likely to be an active topic for future work. 

This diversity of approaches leads to some important points about methods based 
on moment closure. These methods are extremely general, and can be applied to con- 
sider almost any aspect of network structure or disease natural history; they can be 
applied to populations not currently amenable to direct simulation due to their size; 
and they do not require a complete description of the network to run — only certain sta- 
tistical properties. However, there are currently no general methods for the proposal 
of appropriate closure regimes, nor any derivation of the limits on dynamical biases 
introduced by closure. Therefore, closure methods sit somewhere in between exact re- 
sults for highly specialised kinds of network and stochastic simulation where intuitive 
understanding and general analysis are more difficult. 



4.3 Comparison of analytic models with simulation 



In the papers that introduced them, the differential-equation based approximate dynam- 
ical systems above were compared to stochastic simulations on appropriate networks. 
Two rec ent papers making a comp arison of diffe r ent dy namical systems with simula- 
tion are iBansal et al.1 (120071) and iLindquist et al.1 (I20101) . There are, however, several 
issues with attempts to compare deterministic models with simulation and also with 
each other. 
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Firstly, it is necessary to define what is meant by agreement between a smooth, 
deterministic epidemic curve and the rough trajectories produced by simulation. Lim- 
iting results about the exactness of different ODE models assume that both the number 
of individuals infectious and the network size are large, and so the early behaviour of 
simulations, when there are few infectious individuals, is often dominated by stochas- 
tic effects. There are different ways to address this issue, but even after this has been 
done there are two sources of deviation of simulations from their deterministic limit. 
The first of these is the number of simulations realised. If there is a summary statis- 
tic such as the mean number of infectious individuals over time, then the confidence 
interval in such a statistic can be made arbitrarily small by running additional simula- 
tions, but agreement between the deterministic limit and a given realisation may still 
be poor. The second source of deviation is the network size. By increasing the number 
of nodes the prediction interval within which the infection curve will fall can be made 
arbitrarily small, however the computational resources needed to simulate extremely 
large networks can quickly become overwhelming. 

More generally, each approximate model is designed with a different application in 
mind. Models that perform well in one context will often perform poorly in another, 
and this means that 'performance' of a given model in terms of agreement with simula- 
tion will primarily be determined by the discrete network system on which simulations 
are performed. 

The above considerations motivate the example comparisons with simulation that 
we show in Figure [3] This collection of plots is intended to show a variety of different 
example networks, and the dynamical systems intended to capture their behaviour. 

In the first five plots of Figure [3] continuous-time simulations have their tempo- 
ral origin shifted so that they agree on the time at which a cumulative incidence of 
200 is reached, and then confidence intervals in the mean prevalence of infection 
are achieved through bootstrapping. The 95% confidence interval is shown as a red 
shaded region (although typically this is sufficiently narrow it resembles a line). Six 
different determin i stic m odels are compared to simulations: HomPW is the pairwise 
model of Keeling (1999) wit h zero clustering; HetPW is the heterogeneous pairwise 
model of Eames and Keeling (2002); ClustPW is the improved clustered pairwise clo- 
sure of iHouse and Keeling! (|2010bl) : PGF is the model of lVolzl d2008h : Pair -based is 
the mo del of ISharkevI (12008b . integrated using the supplementary code fr om Sharkey 
l l2010t) : and Degree-based is the model of iPastor-Satorras and Vespigna ni (2001). 

Plot A shows a heterogeneous net work composed of two ri sk groups, constructed 
according to the configuration model (iMollov and ReedL 1 19951) . In this case, models 
that incorporate heterogeneity like HetPW and PGF (which are numerically indistin- 
guishable in this case and several others) are in very close agreement with simulation, 
while just taking the average degree as in HomPW is a poor ch oice. In B , ass ortativ- 
ity is added to the two group model following the approach of iNewmanl d2002l) . and 
HetPW outperforms PGF. Plots C and D show regular graphs with four links per node, 
but while C is static in D the rate of making and breaking links is much faster than the 
epidemic process. Models like HomPW and PGF are therefore better for the former and 
Degree-based models are better for the latter — in reality the ratio of the rate of network 
change to the rate of transmis ion may not be either large or small an d so a more sophis- 
ticated method may be best dEamesl l2004t IVolz and Meversl 120071) . E shows a graph 
wi th four links per nod e where clustering has been intr oduced by the rewiring me thod 
of Bansal et al. (2009) sometimes called the 'big V (IHouse and Keeling! l2010bl) . In 
this case ClustPW performs better than HomPW and PGF, but clearly there is signifi- 
cant inaccuracy around the region of peak prevalence and so this model captures quali- 
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A) Two-group configuration model network 
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B) Two-group assortative network 
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Figure 3: Comparison of simulation and deterministic models for six networks 



tatively the effects of clustering without appearing to be exact for this precise network. 
Finally, Plot F considers the case of a one-dimensional next-nearest-neighbour lattice 
(so there are four links per node). This introduces long path lengths between nodes 
in addition to clustering, meaning that the system does not converge onto a period of 
asymptotic early growth and so realisations are shown as a density plot rather than a 
confidence interval. ClustPW accounts for clustering, but not long path lengths and so 
is in poor agreement with simulation while the Pair-based curve captures the qualita- 
tive behaviour of an epidemic on this lattice whilst being quantitatively a reasonable 
approximation. 
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5 Inference on Networks 



In order to be predictive, epidemic models rely on valid values for parameters govern- 
ing outbreak dynamics, conditional on the population structure. However, obtaining 
these parameters is complicated by the fact that, even when knowing the underlying 
contact network structure, infection events are censored — it is only when disease is 
detected either from symptoms or laboratory tests that a case becomes apparent. In 
attempting to surmount this difficulty, parameter estimates are often obtained by mak- 
ing strong assumptions as to the infectious period, or through ad-hoc methods with 
unknown certainty. Measuring the uncertainty in such estimates is as important as ob- 
taining the estimates themselves in providing an honest risk prediction. Given these 
difficulties, inference for epidemic processes has perhaps received little attention in 
comparison to its simulation counterpart. 



The presence of contact network data for populations provides a unique opportu- 
nity to estimate the importance of various modes of disease transmission from disease 
incidence or contact tracing data. For example, given knowledge of the rate of contact 
between two individuals, it is possible to infer the probability that a contact results in 
an infection. If data on mere connectivity (i.e. a 1 if the individuals are connected, and 
otherwise) is available, then it is still possible to infer a rate of infection between 
connected individuals. Thus the detail of t he inference is deter mined to a large extent 



by the available detail in the network data (iJewell et a l.. 2009a). 



5.1 Availability of Data 

Epidemic models are defined in terms of times of transitions between infection states, 
for example a progression from susceptible, to infected, to removed (i.e. recovered 
with lifelong immunity or dead) in the so-called 'SIR' model. Statistical inference re- 
quires firstly that observations of the disease process are made: at the very least this 
comprises the times of case detections, remembering that infection times are always 
censored (you only ever know you have a cold a few days after you caught it). In addi- 
tion, covariate data on the individuals provides structure to the population, and begins 
to enable the statistician to make statements about the importance of individuals' re- 
lationships to one another in terms of disease transmission. Therefore, any covariate 
data, however slight, effectively implies a network structure upon which disease trans- 
mission can be superimposed. 

As long as populations are relatively small (e.g. populations of farms in livestock 
disease analysis), it is common for models to operate at the individual level, providing 
detailed information on case detection times and p erhaps even information on epidemi 



detailed inrormation on case detection times and perhaps even inrormation on epidemi 
ologically significant historical contact events (IJewell et al 1 l2009bl la iKeeling et al 



2001b . In other populations, however, such detailed data may not be available due 
to practical and ethical reasons. Instead, data is supplied on an aggregated spatial 
and/or temporal basis. For the purposes of inference, therefore, this can be regarded as 
a household model, with areas constituting households. 

In a heterogeneous population, the behaviour of an epidemic within any particular 
locality is governed by the relationship between infected and susceptible individuals. 
For inference in the early stages of an epidemic, it is important to quantify the amount 
of uncertainty in the underlying contact networks as the early growth of the epidemic 
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is known to be sub-exponential due to the depletion of the local susceptible popula- 
tion. This contrasts marked ly to the exponentia l grow th observed in a large homoge- 
neously mixing population dAnderson and Mavl 1 1992b . When the network is known 
and details of individual infections are available, contact tracing data may be used to 
infer the network; this data could also be used for infer ence on the epidemic parame- 
ters (Wal linga et al. , 2006; Walli nga and Lipsitchl 120071) . Conversely, if the network is 



completely unknown, it would be useful if estimation of both the epidemic parameters 
and parameters specifying the structure of the network was possible. This is a difficult 
problem because the observed epidemic contains very limited infor mation about the 
underlying network, as demonstrated by iBritton and O'Neilll d2002h . However, with 
appropriate assumptions some results can be obtained; the limited amount of existing 
work in this area is described in section |5. 3 . 1 1 below, although clearly the problem is 
worthy of further study. 

5.1.1 Inference on Homogeneous Models 

For homogeneous models the basic reproduction number, or Rq, has several equiva- 
lent definitions and can be defined in terms of the transmission rate ft and removal rate 
y. For non - hom ogeneous models the definitions are not equivalent, see for example 



Pellis et all (2009) 



Although inference for/? and y is difficult for real applications (see below), it turns 
out that making inference on R q (as a function of ft and y) is rather more straight- 
forward. Hefferna n et al. I d2005h summarise various methods for estimating Rq from 



epidemiological data based on ende mic equil i brium, average age at infection, epidemic 
final size, and in trinsic growth rate (Mollison, 1995t lDiekmann and Heesterbeekl,l2~000t 
HethcotdEoOOl) . However, these methods all rely on observing a complete epidemic, 



and hence for real-time analysis during an epidemic we must make strong assumptions 
concerning the number of currently undetected infectio ns. An example of infer ence for 



Ro based upon complete epidemic data is provided by IStegeman et aL (2004*), where 



data from the 2003 outbreak of High Pathogenicity Avian Influenza H7N7 is fitted to a 
chain-binomial model using a generalised linear model. 

Obviously, complete or near-complete epidemic data is rare and hence it is desirable 
to perform inference based upon partial observation. This is part icularly relevant for 
real time estimation of Rq. For example. ICauchemez et al. I d2006l) attempt to estimate 



Ro in real-time by constructing a discrete-time statistical model that imputes the num- 
be r of secondary cases genera ted by each primary case. This is based on the method 



of Wallinga and Teunis (2004) who formulate a likelihood function for inferring who 



infected whom from dates of symptom onset 

w(tj - h) 



L(i infected j) = 



where w(-) is the probability density function for the generation interval tj - f;, i.e. 
the time between infector z's infection time and infectee fs infection time. Of course, 
infection times are never observed in practice so symptom onset times are used as a 
proxy, with the assumption that the distribution of infection time to symptom onset 
time is the same for every individual. Bayesian methods are used to infer "late-onset" 
cases from known "early-onset" cases, but large uncertainty of course remains when 
inferring the reproductive ratio close to the current time as there exists large uncertainty 
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about the number o f cases detected in th e near future. Additionally, a model for w(-) 
must be chosen (see lLipsitch et al.L I2003D . 



The trade-off in the simplicity of estimating Rq in these ways, however, is that al- 
though a population-wide Rq gives a measure of whether an epidemic is under control 
on a wide-scale, it give no indication as to regional-level, or even individual-level, risk. 
Moreover, the two examples quoted above do not even attempt to include population 
heterogeneity into their models, though the requirement for its inclusion is difficult to 
ascertain in the absence of model diagnostics results. It is postulated, therefore, that a 
simple measure of Rq, although simple to obtain, is not sufficient in order to make tac- 
tical control-policy decisions. In these situations, knowledge of both the transmission 
rate and removal rate are required. 



5.2 Inference on Household Models 

Inference for households models is well developed in comparison to inference for other 
'network' models. In essence this is for three main reasons: firstly, it is a reasonable 
initial approximation to assume that infection either occurs within the home or from 
a random source in the population. Secondly, entire households can be serologically 
sampled following an epidemic, such that the distribution of cases in households of 
given sizes can be ascertained. Finally, it is often a reasonable approximation that 
following introduction of infection into the household, the within-household epidemic 
will go extinct before any further introductions — which dramatically simplifies the 
mathematics. 



The first methods proposed for such inference are maximum likelihood procedures 
based upon chain-binomial models, such as the Reed-Frost m odel, or the st ochastic 
formulation of the Kermack-McKe ndrick model considered by Bartletil ( 1949 ). These 
early methods are summarised by Bailey ( 1975 ). They, and the significant majority 
of methods proposed for household inference to date, use final-size data which can be 
readily obtained from household serology results. A simplifying assumption to facil- 
itate inference in most methods, is that the epidemics within the var ious households 
evolve independently (e.g. see the martingale method of Becker which requires 

the duration of a latent period to be substantial for practical implementation). 

Additionally, fixed probabilities pc and pn, corresponding to a susceptible indi- 
vidual escaping community-acquired infection during the epidemic and escaping in- 
fection when ex posed to a single infected ho usehold member, respectively, were ini- 
tially assumed ( Longini and Koopmant 1982 ). Two important, realistic extensions 
to this framework are to incorporate different levels of risk factors for indi viduals 
dLongini et al. 



1988) and to introduce dependence of pu on an infecti ous period (O'Neill et al 



2000). The latter inclusion was enabled by appealing to results of Ball et al. ( 1997). 



These types of methods are largely based upon the ability to generate closed form for- 
mulae for the final size distribution of the models. 



The ability to relax assumptions further has been predominately due to use of 
Marko v chain Monte Carlo (MCMC) methods as first considered by O'Neill et al.l 
(2000) for household models fo llowing earlier studies of iGibson and Rensh aw ( 1998) 
and O'Neill and Roberts ( 1999b who focused on single, large outbreaks. This method- 
ology has been used to in combination with simulation and data aug mentation ap- 
proaches to tailor inference methods for specific data sets of interest, e.g. lNeal and Roberts 
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(2004) consider a model with a spatial component of distance between households and 



data containing details of dates of symptoms and appearance of r ash, and has also re- 



sulted in a growing number of novel methods for inference. e.g.lCla ncv and O'Neill 



(2007) consider a rejection sampling procedure and Cauchemez etall (|2008) introduce 



a constrained simulation approach. Even greater realism can be captured within house- 
hold models by considering the different compositions of hou seholds and therefore the 
weighted nature of contacts within households. For example ICauchemez et al.l d2004l) 
considered household data from the Epigrippe study of influenza in France 1999-2000, 
and showed that children play a key role in the transmission of influenza and the risk 
of bringing infection into the household. 

Whilst new developments are appearing at an increasing rate, the significant ma- 
jority of methods are based upon final size data and are developed for SIR disease 
models, perhaps due in part to the simplification of arguments for deriving final size 
distributions. One key, but still unanswered question from these analyses of household 
epidemics is how the transmission rate between any two individual s in the household 



scales with the total number of indi viduals in the household (compare lLongini and K oopman 



( 1982 ) and Cauchemez et al.l(l2004 l)). Intuition would suggest that in larger households 



the mixing between any two individuals is decreased, but the precise form of this scal- 
ing is still unclear and much more data on large household sizes is required to provide 
a definitive answer. 



5.2.1 Inference on Fully Heterogeneous populations 

Perhaps the holy grail of statistical inference on epidemics is to make use of an individual- 
level model to describe heterogeneous populations at the limit of granularity. In this 
respect, Bayesian inference on stochastic mechanistic models using MCMC have per- 
haps shown the most promise, allowing inference to be made on both transmission 
parameters and using data augmentation to estimate the infectious period. 

An analysis of the 1861 outbreak of measles in Hagelloch by iNeal and RobertsI 



(2004) demonstrates the use of a reversible jump MCMC algorithm to infer disease 
transmission parameters and infectious period, whilst additionally allowing formal 
comparisons to be made between several nested models. With the uncertainty sur- 
rounding model choice, such methodology is vital to enable accurate und erstanding and 



predic tion. This approach has since been combined with the algorithm of O'Neill and RobertsI 
(11999b and used to analyse disease ou tbreaks such as avi an influenza and foot and 
mout h disease in livestock populations ( Jewell et all 2009blcl Chis Ster and Ferguson , 



2007), and MRS A outbreaks in hospital wards ( Kypr aios et all |20 1 0b 



Whilst representing the cutting edge of inference on infectious disease processes, 
these approaches are currently limited by computing power, with their algorithms scal- 
ing by the number of infectives multiplied by the number of susceptibles. However, 
with advances in computer technology expected at an increasing rate, and small ap- 
proximations made in the calculation of the statistical likelihoods needed in the MCMC 
algorithms, these techniques may well form the mainstay of epidemic inference in the 
future. 
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5.3 Inference From Contact Tracing 



In livestock diseases, part of the standard response to a case detection is to gather con- 
tact tracing information from the farmer. The resulting data are a list of contacts that 
have been made in and out of the in fected farm during a stipulated period prior to the 
notification of disease dDefraLl2007h . In terms of disease control on a local level, this 
has the aim of identifying both the source of infection and any presumed susceptibles 
that might have been infected as a result of the contact. It has been shown that, provid- 
ing the efficiency of following up any contacts to look for signs of disease is high, this 
is a highly effective method of slowing the spread of an epidemic, and finally contain- 
ing it. 



Much has been written on how contact tracing may be used to decrease the time 
between infection and detection (notification) during epidemics. However, this fo- 
cuses on the theoretical aspects of how contact tracing effici ency is related to both epi - 
demic dynamics a nd population struc t ure (se e for example lEames and Keel ing (2003); 



Kiss et al.l (|2005); Klinkenber g et alJ (120061) ). In contrast, the use of contact tracing 
data in inferring epidemic dynamics does not appear to have been well exploited, al- 
though it was used by the Ministry of Agriculture, Fisheries, and Food (now Defra) to 
directly infer a spatial risk kernel for foot and mouth disease in 2001. This assumed 
that the source of infection was correctly identified by the field investigators, thereby 
giving an empirical estimate of the probability of infection a s a function of distance 
dFerguson et alll200ll:lKeeling etall 1200 lb Isavill et all l2006h . Strikingly, this shows 

kernel estimates based on the statistical techniques 
without using contact tracing information. How- 
ever, iCauchemez et al.l d2006) make the point that the analysis of imperfect contact 
tracing data requires more complex statistical approaches, although they abandoned 
contact tracing information altogether in their analysis of the 2003 SARS epidemic in 
China. Nevertheless, recent unpublished work has shown promise in assimilating im- 
perfect contact tracing data and case detection times to greatly improve inference, and 
hence the predictive capability of simulation techniques. 



a hig h degree o f similar ity to spatial k e 
ofjDiggle (2006) and KypraioJ (|2007|) ■ 



5.3.1 Inference From Distributions Over Families of Networks 

Qualitative results from simulations indicate that epidemics on networks, for some pa- 
rameter values, show features that distinguish them from homogeneous models. The 
principal features are a very variable length slow-growth phase , followed by a rapid 



increase in the infection rate and a slower decline after the peak (Keeling, 2005). How- 
ever in quantitative terms there is usually very limited information about the underlying 
network and parameters are often not identifiable. When the details of the network are 
unknown, but something is known or assumed about its formation, estimation of both 
the epidemic parameters and parameters for the network itself are in principle possible 
using MCMC techniques. All the stochastic models for generating networks described 
in section F277l above realise a distribution over all or some of the 2 N( - N ~ 1 ^ 2 possible net- 
works. In most cases this distribution is not tractable; MCMC techniques are in princi- 
ple still possible but in practice would be too slow without careful design of algorithms. 



However with appropriate assumptions some results can be obtained, which pro- 
vide some insight into what more could be achieved. When the network is taken to 
be an Erdos-Renyi graph with unknown parameter p and the epidemic is a Markovian 
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SIR, iBritton and O'Neilll d2002l) showed that it is possible to estimate the parameters, 
although they highlight the ever-present challenge of disentangling epidemiological 
from n etwork parameters. The MCMC algorithm was improved by Neal and Roberts 



(2005) and the extension from SIR to SEIR has been developed by Groendvke et al. 



20101) . However, the extension to more realistic families of networks remains a chal- 
lenging problem, and will undoubtedly be the subject of exciting future research. 



6 Discussion 

The use of networks is clearly a rapidly growing field in epidemiology. By assessing 
(and quantifying) the potential transmission routes between individuals in a population, 
researchers are able to both better understand the observed distribution of infection as 
well as create better predictive models of future prevalence. We have shown how many 
of the structural features in commonly-used contact networks can be quantified and 
how there is an increasing understanding of how such features influence the propaga- 
tion of infection. However, a variety of challenges remain. 



6.1 Open Questions 

Several open problems remain if networks are to continue to influence predictive epi- 
demiology. The majority of these stem from the difficulty in obtaining realistic trans- 
mission networks for a range of pathogens. Although some work has been done to 
elucidate the interconnected structure of sexual encounters (and hence the sexual trans- 
mission network), these are still relatively small-scale compared to the population size 
and suffer from a range of potential biases. Determining comparable networks for 
airborne infections is a far greater challenge, due to the less precise definition of a po- 
tential contact. 



One practical issue is therefore whether new techniques can be developed that al- 
low contact networks to be as sessed remotely. Pro ximity loggers, such as those used 
by Hamede and colleagues (IHamede et all [2009), provide one potential avenue al- 
though it would require the technology to become sufficiently robust, portable and 
cheap that a very large proportion of a population could be convinced to carry one at 
all times. For many human populations, where the use of mobile phones (which can 
detect each other via Bluetooth) is sufficiently widespread, there is the potential to use 
them to gather network information — although the challenges of developing suffi- 
ciently generic software should not be underestimated. While these remotely sensed 
networks would provide unparalleled information that could be obtained with the min- 
imum of effort, there would still be some uncertainty surrounding the nature of each 
contact. 



There is now a growing set of diary-based studies that have attempted to record the 
personal contacts of a lar ge number of individ uals; of these POLYMOD is currently 
the most comprehensive ( Mossong et al. , 20081) . While such egocentric data obviously 
provides extensive information on individual behaviour, due to the anonymity of such 
surveys it is not clear how the alters should be connected together. The configuration 
method of randomly connecting half-links provides one potential solution, but what is 
ideally required is a more comprehensive method that would allow clustering, spatially 
localised connections and assortativity between degree distributions to be included and 
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specified. 



Associated with the desire to have realistic contact networks for entire populations, 
comes the need to characterise such networks in a relatively parsimonious manner that 
provides important insights into the types of epidemiological dynamics that could be re- 
alised. Such a characterisation would allow for different networks (from different times 
or different locations) to be compared in a manner that is epidemiologically significant; 
and would allow artificial networks to be created that matched particular known net- 
work features. This clearly relies on both existing measures of network structure (as 
outlined in section 3) together with a robust understand of how such features influence 
the transient epidemic dynamics (as outlined in section 4.2). However, such a generic 
understanding of all network features is unlikely to arise for many years. A more im- 
mediate challenge is to understand ways in which local network structure (clustering, 
cliques and spatially-localised connections) influence the epidemiological dynamics. 

To date the vast majority of the work into disease transmission on networks has fo- 
cused on static networks where all links are of equal strength and therefore associated 
with the same basic rate of transmission. However, it is clear that contact networks 
change over time (both on the short-time scale of who we meet each day, and on the 
longer time-scale of who our main work and social contacts are), and that links have 
different weights (such that some contacts are much more likely to lead to the trans- 
mission of infection than others). While the simulation of infection on such weighted 
time-varying networks is feasible, it is unclear how the existing sets of network proper- 
ties or the existing literature of analytical approaches can be extended to such higher- 
dimensional networks. 

For any methodology to have any substantive use in the field, it is important both to 
have effective data gathering protocols in place, and to have the statistical techniques in 
place to analyse it. Here, three issues are perhaps most critical. Firstly, data gathering 
resources are almost always limited. Therefore, carefully designed randomised sam- 
pling schemata should be employed to maximise the power of the statistical techniques 
used to analyse data, rather than having to reply on data augmentation techniques to 
work around the problems present in ad-hoc datasets. This aspect is particularly im- 
portant when working on network data derived from population samples. Secondly, 
any inference on both network and infectious disease models should be backed up by 
a careful analysis of model fit. Although recent advances in statistical epidemiology 
have given us an unprecedented ability to measure population/disease dynamics based 
on readily available field data, epidemic model diagnostics are currently in their in- 
fancy in comparison to techniques in other areas of statistics. Therefore it is expected 
that, with the growth in popularity of network models for analysing disease spread, 
much research effort will be required in designing such methodology. 



6.2 Conclusions 

We have highlighted that the study of contact networks is fundamentally important 
to epidemiology and provides a wealth of tools for understanding and predicting the 
spread of a range of pathogens. As we have outlined above many challenges still exist, 
but with growing interest in this highly interdisciplinary field and ever increasing so- 
phistication in the mathematical, statistical and remote-sensing tools being used, these 
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problems may soon be overcome. We conclude therefore that now is an exciting time 
for research into network epidemiology as many of the practical difficulties are sur- 
mounted and theoretical concepts are translated into results of applied importance in 
infection control and public health. 
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7 Notation 



Concept/Measure 


Other common names 


Our notation 


Other common notation 


Network 


Graph 


G 




Node 


Vertex, point, site, actor 


n 


V 


Edge 


Link, tie, bond 


I 


e 


Adjacency matrix 


Connectivity matrix 


G U 


dij, Aij 


Number of nodes 


Size of network 


N 


n,S 


Number of edges 


Graph size 


L 


e,l 


Centrality 




C 




Degree 


Connectivity 


k 


d, C d 


Betweenness 




Bi 


bet,-, C b 


Degree distribution 


Connectivity distribution 


P(k) 


Pk,Pk 


Shortest path distance 


Geodesic distance 


Dij 


dij 


Clustering 


transitivity 


<P 


c, O 


Number of nodes of type A 




[A] 


n A , N A 


Number of A - B pairs 




[AB] 


riAB, Nab 


Diameter 


Maximal shortest path 


Diam(G) 


max(D;j) 
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